AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio. As a Data Scientist at AllLife bank I have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

The objectives are to predict whether a liability customer will buy a personal loan or not, which variables are more significant, and which segment or customers should be targeted more.

The original dataset has about 5000 rows and 14 columns.

Experience of negative years is not realistic and would be treated in this work. It's possibly a typo error during data collection!

Feature Engineering on the Experience column

Experience has now been refined by making the number of years absolute (non-negative)

# Let's get value counts on each column

The zipcodes are different and could [possibly] be key variable in predicting personal loan users. So, preprocessing on zipcode is needed.

Personal loan users relative to non-users are not balanced. So, the best metrics will be based on recall, not accuracy.

The dataset represents same state - California (CA). So, the state zipcodes will not necessarily determine the prediction of clients that accepted or rejected personal loan.

Let's remove the column for ZIPCode to simplify the dataset

Let's transform some continuous variables into categorical variables to reduce the features' value counts based on 25%, 75% and maximum value.

Our new dataset becomes:

Univariate Analysis

More than 50% of the customers are between 36 to 55 years old, while customers below 36 years old and over 55 years old have similar proportion in the dataset

The age distribution is fairly symmetric and approximates a normal distribution.

More than 50% of the customers have 11 to 30 years experience. Customers with less than 11 years experience and customers with over 30 years experience have similar proportion in the dataset

Experience appears normally distributed.

Nearly 50% of the customers have their annual income within the range of 40000 to 98000 dollars.

Income is not normally distributed. In fact, it skews towards the right.

Family size is balanced within the dataset (between 20 to 30% of each family size in the dataset). However, family size is not normally distributed.

Bivariate Analysis

Majority of the customers based on different features on the dataset did not buy personal loan.

A large proportion of customers with income level over 98000 dollars bought personal loan compared to other income group.

Mpre customers with CCAvg level over 2500 dollars bought personal loan compared to other CCAvg groups.

More customers with CD_Account bought personal loan compared to customers without CD_Account.

Experience and Age show a strong positive correlation; followed by CCAvg and Income

Model Building - Approach

1.Data preparation
2.Partition the data into train and test set.
3.Built a CART model on the train data.
4.Tune the model and prune the tree, if required.

Split Data

Build Decision Tree Model

Check model performance on test set

Model is giving good and generalized results on training and test set.

Visualizing the Decision Tree

Education, income group over 98000 dollars, family, and CCAvg over 2500 dollars are the more significant variables to identify potential customer who would purchase personal loan.

Using GridSearch for Hyperparameter tuning of our tree model

Checking performance on training set

Checking model performance on test set

Cost Complexity Pruning

Recall vs alpha for training and testing sets

Checking model performance on training set

Checking model performance on test set

Visualizing the Decision Tree

Education, income, family, CCAvg, and CD_Account remain the most important feature with post-pruning. Age is no longer significant.

Comparing all the decision tree models

Decision tree with post-pruning is giving the highest recall on the test set. The tree with post pruning is not complex and easy to interpret.

Business Insights (Based on Decision Tree)

Education, income group and family (in that order) are the most important variables in determining whether a liability customer will buy a personal loan or not.

Customers with income over $98000 is more likely to buy a personal loan.


Logistic Regression

Coefficient interpretations

Coefficients of Family, Education, CD_Account, and some categorical levels including Age group up to 55 years, Experience group between 11-30 years, Income group over 98000 dollars, CCAvg group over 2500 dollars and Mortgage group over 101,000 dollars are positive; an increase in these will lead to an increase in chances of a customer buying a personal loan.

Coefficients of Securities_Account, Online, CreditCard, and some categorical levels including Age group over 55 years, Experience group less than 11 years and Experience group over 30 years, income group less than $98,000, CCAvg group less than 2500 dollars, and Mortgage group less than 101,000 dollars are negative; an increase in these will lead to a decrease in chances of a customer buying a personal loan.

Converting coefficients to odds

Coefficient interpretations

Family: Holding all other features constant a unit change in Family Size of the customer will increase the odds of a customer buying a personal loan by 1.96 times or a 96.50% increase in the odds.

Education: Holding all other features constant a unit change in Education level will increase the odds of a customer buying a personal loan by 5.36 times or a 436.07% increase in the odds.

Securities_Account: Holding all other features constant a unit change in Securities_Account will decrease the odds of a customer buying a personal loan by 0.36 times or a 63.77% decrease in the odds.

CD_Account: Holding all other features constant a unit change in CD_Account will increase the odds of a customer buying a personal loan by 21.02 times or a 2002.47% increase in the odds.

Online: Holding all other features constant a unit change in Online internet banking usage will decrease the odds of a customer buying a personal loan by 0.74 times or a 25.79% decrease in the odds.

Checking model performance on training set

ROC-AUC

ROC-AUC on training set

Model Performance Improvement

Let's see if the recall score can be improved further, by changing the model threshold using AUC-ROC Curve.

Model Performance Improvement

Let's see if the recall score can be improved further, by changing the model threshold using AUC-ROC Curve.

Checking model performance on training set

Model performance has improved significantly on training set. Model is giving a recall of 0.85 on the training set.

Let's use Precision-Recall curve and see if we can find a better threshold

Checking model performance on training set

Recall has improved as compared to the initial logistic regression model.

Model Performance Summary

Let's check the performance on the test set

Using the model with default threshold

ROC-AUC on test set

Using the model with threshold of 0.26

Using the model with threshold 0.34

Model performance comparison

Conclusion

By changing the threshold of the logistic regression model we were able to see a significant improvement in the model performance. The model achieved a recall of 0.85 on the training set with threshold set at 0.26.

Comparing Logistic Regression and Decision Tree on Train Data

Comparing Logistic Regression and Decision Tree on Test Data

Conclusions and Recommendations

Decision Tree model gave a higher recall value up to about 90.67% compared to Logistic Regression recall value up to 80%.

In fact, Decision Tree Regression model before any form of improvement gave a recall value up to 74.61%, while the Logistic Regression gave a recall value of 64.25%. Also, the Decision Tree model performs better in terms of all the metrics (accuracy, recall, precision, and F1) considered in the analysis.

Hence, the Decision Tree model is recommended for the prediction of whether a liability customer will buy a personal loan or not.

In the prediction whether a liability customer will buy a personal loan or not, the Education level, Family, and Income level of the customer are more significant. Customers with Income level above 98,000 dollars will most likely buy a personal loan. In both Logistic Regression and Decision Tree models, an increase in Education level, Family size, and Income level above 98000 dollars will lead to higher chance of a liability customer buying a personal loan.

So, it is advised that customer segments with higher education, family and income level above 98,000 dollars should be targeted for personal loan.

Even though education and income showed a very low correlation with personal loan, they are the more important variables in predicting whether or not a customer takes personal loan. So, bivariate analysis between dependent and independent variable is not enough to what varaibles are of more relative importance in predicting the dependent variable.

Hence, it is highly recommended that a holistic analysis should be conducted and model developed to make better business decision.

AllLife Bank should target customers with based on their level of education. So, the higher the level of education, the more likely and oriented they would seek personal loan.

AllLife Bank should target customers with larger family size for personal loan decision.

Customers with income level above 98000 dollars could be approached and advised to obtain personal loan.

Adopting all three recommendations together (not just separately) will create a more robost tool for personal loan decision making.

Therefore, the objectives of this analysis which are to predict whether a liability customer will buy a personal loan or not, which variables are more significant, and which segment or customers should be targeted more have been achieved.